29 research outputs found

    Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

    Full text link
    Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets. While most data decontamination efforts apply string matching (e.g., n-gram overlap) to remove benchmark data, we show that these methods are insufficient, and simple variations of test data (e.g., paraphrasing, translation) can easily bypass these decontamination measures. Furthermore, we demonstrate that if such variation of test data is not eliminated, a 13B model can easily overfit a test benchmark and achieve drastically high performance, on par with GPT-4. We validate such observations in widely used benchmarks such as MMLU, GSK8k, and HumanEval. To address this growing risk, we propose a stronger LLM-based decontamination method and apply it to widely used pre-training and fine-tuning datasets, revealing significant previously unknown test overlap. For example, in pre-training sets such as RedPajama-Data-1T and StarCoder-Data, we identified that 8-18\% of the HumanEval benchmark overlaps. Interestingly, we also find such contamination in synthetic dataset generated by GPT-3.5/4, suggesting a potential risk of unintentional contamination. We urge the community to adopt stronger decontamination approaches when using public benchmarks. Moreover, we call for the community to actively develop fresh one-time exams to evaluate models accurately. Our decontamination tool is publicly available at https://github.com/lm-sys/llm-decontaminator

    On Optimal Caching and Model Multiplexing for Large Model Inference

    Full text link
    Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing. Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings. By combining a caching algorithm, namely Greedy Dual Size with Frequency (GDSF) or Least Expected Cost (LEC), with a model multiplexer, we achieve optimal rates in both offline and online settings. Empirically, simulations show that the combination of our caching and model multiplexing algorithms greatly improves over the baselines, with up to 50×50\times improvement over the baseline when the ratio between the maximum cost and minimum cost is 100100. Experiments on real datasets show a 4.3×4.3\times improvement in FLOPs over the baseline when the ratio for FLOPs is 1010, and a 1.8×1.8\times improvement in latency when the ratio for average latency is 1.851.85

    Overestimation of thermal emittance in solenoid scans due to coupled transverse motion

    Full text link
    The solenoid scan is a widely used method for the in-situ measurement of the thermal emittance in a photocathode gun. The popularity of this method is due to its simplicity and convenience since all rf photocathode guns are equipped with an emittance compensation solenoid. This paper shows that the solenoid scan measurement overestimates the thermal emittance in the ordinary measurement configuration due to a weak quadrupole field (present in either the rf gun or gun solenoid) followed by a rotation in the solenoid. This coupled transverse dynamics aberration introduces a correlation between the beam's horizontal and vertical motion leading to an increase in the measured 2D transverse emittance, thus the overestimation of the thermal emittance. This effect was systematically studied using both analytic expressions and numerical simulations. These studies were experimentally verified using an L-band 1.6-cell rf photocathode gun with a cesium telluride cathode, which shows a thermal emittance overestimation of 35% with a rms laser spot size of 2.7 mm. The paper concludes by showing that the accuracy of the solenoid scan can be improved by using a quadrupole magnet corrector, consisting of a pair of normal and skew quadrupole magnets.Comment: 12 pages, 13 figure

    Factorized Q-Learning for Large-Scale Multi-Agent Systems

    Full text link
    Deep Q-learning has achieved significant success in single-agent decision making tasks. However, it is challenging to extend Q-learning to large-scale multi-agent scenarios, due to the explosion of action space resulting from the complex dynamics between the environment and the agents. In this paper, we propose to make the computation of multi-agent Q-learning tractable by treating the Q-function (w.r.t. state and joint-action) as a high-order high-dimensional tensor and then approximate it with factorized pairwise interactions. Furthermore, we utilize a composite deep neural network architecture for computing the factorized Q-function, share the model parameters among all the agents within the same group, and estimate the agents' optimal joint actions through a coordinate descent type algorithm. All these simplifications greatly reduce the model complexity and accelerate the learning process. Extensive experiments on two different multi-agent problems demonstrate the performance gain of our proposed approach in comparison with strong baselines, particularly when there are a large number of agents.Comment: 7 pages, 5 figures, DAI 201
    corecore